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Abstract 

Cn This paper analyzes the problem of Gaussian process (GP) bandits with deterministic observations. 

. 1 The analysis uses a branch and bound algorithm that is related to the UCB algorithm of (Srinivas 

jy! et al., 2010). For GPs with Gaussian observation noise, with variance strictly greater than zero, 

, ^, (Srinivas et al., 2010) proved that the regret vanishes at the approximate rate of O ( 4= 1 , where t 

is the number of observations. To complement their result, we attack the deterministic case and 

^~^ attain a much faster exponential convergence rate. Under some regularity assumptions, we show 

j~^ that the regret decreases asymptotically according to O (e (i"')'*/* j with high probability. Here, 

t^^ d is the dimension of the search space and r is a constant that depends on the behaviour of the 

.. objective function near its global maximum. 

rn 
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1. Introduction 



. . Let / : I? — > M be a function on a compact subset VCR. We would like to address the global optimization 

^ problem 

k>( xm — aigmaxf{x). 

rS xev 

Cu Let us assume for the sake of simplicity that the objective function / has a unique global maximum (although 

it may have many local maxima). 

The space V might be the set of free parameters that one could feed into a time-consuming algorithm or 
the locations where a sensor could be deployed, and the function / might be a measure of the performance 
of the algorithm (e.g. how long it takes to run). We refer the reader to (Mockus, 1982; Schonlau et al., 
1998; Gramacy et al., 2004; Brochu et al., 2007; Lizottc, 2008; Martinez-Cantin et al., 2009; Garnett et al., 
2010) for many practical examples of this global optimization setting. In this paper, our assumption is that 
once the function has been probed at point x G V, then the value f{x) can be observed with very high 
precision. This is the case when the deployed sensors are very accurate or if the algorithm is deterministic. 
An example of this is the configuration of CPLEX parameters in mixed-integer programming (Hutter et al., 
2010). More ambitiously, we might be interested in the simultaneous automatic configuration of an entire 
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Figure 1. An example of the Lipschttz hypothesis being used to discard pieces of the search space when finding the 
maximum of a function f . Although f is only known at the red sample points, if the derivative upper bounds (dashed 
lines) are below the best attained value thus far, f{x'^), the corresponding areas of the search space (shaded regions) 
may be discarded. 



system (algorithms, architectures and hardware) whose performance is deterministic in terms of several free 
parameters and design choices. 

Global optimization is a difficult problem without any assumptions on the objective function /. The main 
complicating factor is the uncertainty over the extent of the variations of /, e.g. one could consider the 
characteristic function, which is equal to 1 at xm and elsewhere, and none of the methods we mention here 
can optimize this function without exhaustively searching through every point in T). 

The way a large number of global optimization methods address this problem is by imposing some prior 
assumption on how fast the objective function / can vary. The most explicit manifestation of this remedy is 
the imposition of a Lipschitz assumption on /, which requires the change in the value of f{x), as the point x 
moves around, to be smaller than a constant multiple of the distance traveled by x (Hansen et al., 1992). As 
pointed out in (Bubeck et al., 2011, Figure 3), it is only important to have this kind of tight control over the 
function near its optimum: elsewhere in the space, we can have what they have dubbed a "weak Lipschitz" 
condition. 

One way to relax these hard Lipschitz constraints is by putting a Gaussian Process (GP) prior on the function. 
Instead of restricting the function from oscillating too fast, a GP prior requires those fast oscillations to have 
low probability, cf. (Ghosal & Roy, 2006, Theorem 5). 

The main point of these bounds (be they hard or soft) is to assist with the exploration- exploitation trade-off 
that global optimization algorithms have to grapple with. In the absence of any assumptions of convexity on 
the objective function, a global optimization algorithm is forced to explore enough until it reaches a point in 
the process when with some degree of certainty it can localize its search space and perform local optimization 
(exploitation). Derivative bounds such as the ones discussed here together with the boundedness of the search 
space, guaranteed by the compactness assumption on I?, provide us with such certainty by producing a useful 
upper bound that allows us to shrink the search space. This is illustrated in Figure 1. Suppose we know that 
our function is Lipschitz with constant L, then given sample points as shown in the figure, we can use the 
Lipschitz property to discard pieces of the search space. This is done by finding points in the search space 
where the function could not possibly be higher than the maximum value already encountered. Such points 
are found by placing cones at the sampled points with slope equal to L and checking where those cones lie 
below the maximum observed value. 

This crude approach is wasteful because very often the slope of the function is much smaller than L. As we 
will see below (cf. Figure 2), GPs do a better job of providing lower and upper bounds that can be used to 
limit the search space, by essentially choosing Lipschitz constants that vary over the search space and the 
algorithm run time. 
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Figure 2. An example of our branch and bound maximization algorithm with UCB surrogate ji + Ba, where fi and a 
are the mean and standard deviation of the GP respectively. The region consisting of the points x for which the upper 
confidence bound j-l{x) + Ba{x) is lower that the maximum value of the lower confidence bound i-i{x) — Bo(x) does 
not need to be sampled anymore. Note that the UCB surrogate function bounds f from above. 



We also assume that the objective function / is costly to evaluate (e.g. time- wise or financially). We would 
like to avoid probing / as much as possible and to get close to the optimum as quickly as possible. A 
solution to this problem is to approximate / with a surrogate function that provides a good upper bound for 
/ and which is easier to calculate and optimize. Surrogate functions can also aid with global optimization 
by restricting the domain of interest. 

GPs enable us to construct surrogate functions, which are relatively easy to evaluate and optimize. We refer 
the reader to (Brochu et al., 2009) for a general review of the literature on the various surrogate functions 
utilized in GP bandits in the context of Bayesian optimization. 

The surrogate function that we will make extensive use of here is called the Upper Confidence Bound (UCB). 
It is defined to be /i + Ba, where /^ and a are the posterior predictive mean and standard deviation of the 
GP and B is a constant to be chosen by the algorithm. This surrogate function has been studied extensively 
in the literature and this paper relies heavily on the ideas put forth in the paper by Srinivas et al (Srinivas 
et al., 2010), in which the algorithm consists of repeated optimization of the UCB surrogate function after 
each sample. 

One key difference between our setting and that of (Srinivas et al., 2010) is that, whereas we assume that 
the value of the function can be observed exactly, in (Srinivas et al., 2010) it is necessary for the noise to be 
non-trivial (and Gaussian) because the main quantity that is used in the estimates, namely information gain, 
cf. (Srinivas et al., 2010, Equation 3), becomes undefined when the variance of the observation noise [a"^ in 
their notation) is set to 0, cf. the expression for l{yA', ^a) that was given in the paragraph following Equation 
(3). So, their setting is complementary to ours. Moreover, we show that the regret, r{xt) = max-p / — f{xt), 

decreases according to O I e (i"*'** 



implying that the cumulative regret is bounded from above. 



The paper whose results are most similar to ours is (Munos, 2011), but there are some key differences in 
the methodology, analysis and obtained rates. For instance, we are interested in cumulative regret, whereas 
the results of (Munos, 2011) are proven for finite stop-time regret. In our case, the ideal application is the 
optimization of a function that is C^-smooth and has an unknown non-singular Hessian at the maximum. We 

obtain a regret rate O (e (i-*)'''''' j, whereas the DOO algorithm in (Munos, 2011) has regret rate ©(e^*) if 

the Hessian is known and the SOO algorithm has regret rate 0(e~^*) if the Hessian is unknown. In addition, 
the algorithms in (Munos, 2011) can handle functions that behave like — c||a; — xa/ ||" near the maximum 
(cf. Example 2 therein). This problem was also studied by (Vazquez & Beet, 2010) and (Bull, 2011), but 
using the Expected Improvement surrogate instead of UCB. Our methodology and results are different, but 
complementary to theirs. 



2. Gaussian process bandits 
2.1. Gaussian processes 

As in (Srinivas et al., 2010), the objective function is distributed according to a Gaussian process prior: 

/(x)^GP(m(.), «(•,•)). (1) 

For convenience, and without loss of generahty, we assume that the prior mean vanishes, i.e., m{-) — 0. 
There are many possible choices for the covariance kernel. One obvious choice is the anisotropic kernel k 
with a vector of known hyperparameters (Rasmussen & Williams, 2006): 

K{xi,Xj) = Ti^-ixi - Xj^D^Xi- Xj)) , (2) 

where k is an isotropic kernel and D is a diagonal matrix with positive hyperparameters along the diagonal 
and zeros elsewhere. Our results apply to squared exponential kernels and Matern kernels with parameter 
ly > 2. In this paper, we assume that the hyperparameters are fixed and known in advance. 

We can sample the GP at t points by choosing points xi:t := {xi, . . . ,Xt} and sampling the values of the 
function at these points to produce the vector fi-t = [f{xi) ■ ■ ■ f{xt)]^. The function values are distributed 
according to a multivariate Gaussian distribution Af{0, K), with covariance entries K{xi, Xj). Assume that we 
already have several observations from previous steps, and that we want to decide what action Xt+i should 
be considered next. Let us denote the value of the function at this arbitrary new point as ft+i- Then, by 
the properties of GPs, ti-t and ft+i are jointly Gaussian: 



■ fl:t 

ft+1 



'MO 



K k^ 

k K{xt+l,Xt+l) 



where k = [K{xt+i, xi) ■ ■ ■ K{xt+i, Xt)]^ ■ Using the Schur complement, one arrives at an expression for the 
posterior predictive distribution: 



where 



P{ft+l\Xl:t+l,{l:t) =Af{fit{xt+l),Crt{xt+l)), 



Ht{xt+l) = k K fl:t, , , 

af{xt+i)^ n{xt+i,xt+i)-]<J'K~^'k 



and fi:t = [f{xi)--- f{xt)Y . 



2.2. Surrogates for optimization 

When it is assumed that the objective function / is sampled from a GP, one can use a combination of the 
posterior predictive mean and variance given by Equations (3) to construct surrogate functions, which tell 
us where to sample next. Here we use the UCB combination, which is given by 

Ht{x) + Btat[x), 

where {i?*}^^ i^ ^ sequence of numbers specified by the algorithm. This surrogate trades-off exploration 
and exploitation since it is optimized by choosing points where the mean is high (exploitation) and where the 
variance is large (exploration). Since the surrogate has an analytical expression that is easy to evaluate, it is 
much easier to optimize than the original objective function. Other popular surrogate functions constructed 
using the sufficient statistics of the GP include the Probability of Improvement, Expected Improvement and 
Thompson sampling. We refer the reader to (Brochu et al., 2009; May et al., 2010; Hoffman et al., 2011) for 
details on these. 

2.3. Our algorithm 

The main idea of our algorithm (Algorithm 1) is to tighten the bound on / given by the UCB surrogate 
function by sampling the search space more and more densely and shrinking this space as more and more 



Algorithm 1 Branch and Bound 

Input: A compact subset I? C M , a discrete lattice /! C I? and a function / : I? — ^ M. 

repeat 

Sample Twice as Densely: 

-^ 

• Sample / at enough points in C so that every point in TZ is contained in a simplex of size 5. 
Shrink the Relevant Region: 

• Set 

TZ:= Ix ^TZ ^irix) + \/liT(^T{x) > sup^T(a;) — \/I3t^t{x) 
I n 

T is the number points sampled so far and Pt = 2 In (^^7^ ) = 41nr + 21n i^ with a € (0, 1) 

• Solve the following constrained optimization problem: 

{xl,x*2)= argsuj3 __||a;i -X2II 

(xi,a;2)eKxK 
X* + X* \ 

• 7?. <— i? ( —^—^ — -, lla^i — a;2ll ) I where B{p^ r) is the ball of radius r centred around p. 
until 7^ n £ = 



of the UCB surrogate function is "submerged" under the maximum of the Lower Confidence Bound (LCB). 
Figure 2 illustrates this intuition. 

More specifically, the algorithm consists of two iterative stages. During the first stage, the function is sampled 
along a lattice of points (the red crosses in Figure 3). In the second stage, the search space is shrunk to 
discard regions where the maximum is very unlikely to reside. Such regions are obtained by finding points 
where the UCB is lower than the LCB (the complement of the colored region in the same panel as before) . 
The remaining set of relevant points is denoted by TZ. In order to simplify the task of shrinking the search 
space, we simply find an enclosing ball, which is denoted by TZ in Algorithm 1. Back to the first stage, we 
consider a lattice that is twice as dense as in the first stage of the previous iteration, but we only sample at 
points that lie within our new smaller search space. 

In the second stage, the auxiliary step of approximating the relevant^ set TZ with the ball 7?, introduces 
inefficiencies in the algorithm, since we only need to sample inside TZ. This can be easily remedied in 
practice to obtain an efficient algorithm. Our analysis will show that even without these improvements it is 
already possible to obtain very strong exponential convergence rates. Of course, practical improvement will 
result in better constants and ought to be considered seriously. 

3. Analysis 

3.1. Approximation results 

We begin our analysis by showing that, given sufficient explored locations, the residual variance is small. 
More specifically, for any point x contained in the convex hull of a set of d points that are no further than S 
apart from x, we show that the residual is bounded by 0(||/i||^ (5^), where ||/i||^ is the Hilbcrt Space norm 
of the associated function and that furthermore the residual variance is bounded by 0{6'^). We begin by 
relating residual variance, projection operators, and interpolation in Hilbert Spaces. Lemmas 1, 2 and 3 are 
standard. We include their proofs in the supplementary material for the purpose of being self-contained. 
Proposition 4 is our key approximation result. It plays a central role in the proof of our exponential regret 
bounds. Its proof, as well as the proof for the main theorem, is included in the supplementary material. 




Figure 3. Branch and Bound algorithm for a 2D function. The colored region is the search space and the color-map, 
with red high and blue low, illustrates the value of the UCB. Four steps of the algorithm are shown; progressing from 
left to right and top to bottom. The green dots designate the points where the function was sampled in the previous 
steps, while the red crosses denote the freshly sampled points. 

Lemma 1 (Hilbert Space Properties) Given a set of points Xi-t '■— {xi, . . . ,xt} G 2? and a Reproduc- 
ing Kernel Hilbert Space (RKHS) % with kernel k the following bounds hold: 

1. Any h £ H is Lipschitz continuous with constant \\h\\^L, where \\-\\fj is the Hilbert space norm and L 
satisfies the following: 



L^ < sup dxdx'K{x,x')\x=x' 
xev 



(4) 



and for k{x, x') = k{x — x') we have 



L'<dmx)\,=o- 



2. Any /i e "H has its second derivative bounded by \\h\\-^Q where 



< s\ipd^d^,K{x,x')\^^r,, 
xeT> 



(5) 



and for k.{x, x') — k(x — x') we have 



Q' < dlK{x)\x=0- 



3. The projection operator Pi-t on the subspace span{K(xt, •)} ^H is given by 

t=l:T 



Pi:T/i:=k'(.)K-^(k(.),/i) 



(6) 



where k(-) — ki:T(-) '■— [k(xi, •) • • • k{xt, ■)] and K := [K{xi,Xj)]^ ^i-t'' ''moreover, we have that 

{K(xi,-),h) 



(k(.),/i) 



h{xi) 
h{xT)^ 



{K{xT,-),h) 

Here Pi-.tPi-.t = Pi-.T and \\Pi:t\\ < 1 and \\1 - Pi.,t\\ < 1. 

4-. Given sets Xi-t C Xi-t' it follows that \\Pi.Th\\-^ < \\Pi:T'h\\-^ < ||/i||-j^. 

5. Given tuples {xi,hi) with hi = h{xi), the minimum norm interpolation h with h{xi) = h{xi) is given by 
h — Pi-rh- Consequently its residual g :— {\ — Pi.y)ft, satisfies g{xi) = for all Xi G Xi-t- 



Lemma 2 (GP Variance) Under the assumptions of Lemma 1 it follows that 

\h{x) - P^.,Th{x)\ <\\h\\^aT{x), 



(7) 



where a'^{x) — K{x,x) — \i^.rp{x)'K. "'^ki:7^(a;) and this hound is tight. Moreover, a^{x) is the residual variance 
of a Gaussian process with the same kernel. 

Lemma 3 (Approximation Guarantees) We denote by xi-t Q T) a set of locations and assume that 
g{xi) ^ for all Xi e xi,t. 

1. Assume that g is Lipschitz continuous with bound L. Then g(x) < Ld(x,xi;T), where d(x,xi.T) is the 
minimum distance \\x — Xi\\ between x and any Xi € xi-^- 

2. Assume that g has its second derivative bounded by Q' . Moreover, assume that x is contained inside the 
convex hull of xi-t such that the smallest such convex hull has a maximum pairwise distance between 
vertices of d. Then we have g{x) < -rQ'd^. 



Proposition 4 (Variance Bound) Let k :M. xM -^R be a kernel that is four times dijferentiable along 
the diagonal {{x,x) | x e M }, with Q defined as in Lemma 1.2, and f '^ GP (0, k(-, •)) a sample from the 
corresponding Gaussian Process. If f is sampled at points Xi-t — {xi, . . . ,Xt} that form a d-cover of a 
subset T> CM. , then the resulting posterior predictive standard deviation aT satisfies 

SUpCT < — — • 

x> 4 



3.2. Finiteness of regret 

Having shown that the variance vanishes according to the square of the resolution of the lattice of sampled 
points, we now move on to show that this estimate implies an exponential asymptotic vanishing of the regret 
encountered by our Branch and Bound algorithm. This is laid out in our main theorem stated below and 
proven in the supplementary material. 

The theorem considers a function /, which is a sample from a GP with a kernel that is four times differentiable 
along its diagonal. The global maximum of / can appear in the interior of the search space, with the function 
being twice differentiable at the maximum and with non-vanishing curvature. Alternatively, the maximum 
can appear on the boundary with the function having non-vanishing gradient at the maximum. Given 
a lattice that is fine enough, the theorem asserts that the regret asymptotically decreases in exponential 
fashion. 

The main idea of the proof of this theorem is to use the bound on a given by Proposition 4 to reduce the size 
of the search space. The key assumption about the function that the proof utilizes is the quadratic upper 
bound on the objective function / near its global maximum, which together with Proposition 4 allows us to 
shrink the relevant region TZ in Algorithm 1 rapidly. The figures in the proof give a picture of this idea. The 



only complicating factor is the factor y^Jt in the expression for the UCB that needs to be estimated. This 
is dealt with by modeling the growth in the number of points sampled in each iteration with a difference 
equation and finding an approximate solution of that equation. 

Recall that 2? C M is assumed to be a non-empty compact subset and / a sample from the Gaussian 

Process GP (0, k(-, •)) on T). Moreover, in what follows we will use the notation xm '■= argmax/(a;). Also, 

xev 
by convention, for any set S, we will denote its interior by S°, its boundary by dS and if 5 is a subset of 

M , then conv(S') will denote its convex hull. The following holds true: 
Theorem 5 Suppose we are given: 

1. a > 0, a compact subset 2? C M , and k a stationary kernel on M that is four times dijferentiable; 

2. f ^ GP(0, k) a continuous sample on V that has a unique global maximum xm, which satisfies one of 
the following two conditions: 

(t) Xm e 2?° and f{xM)''Ci\\x~XM\\'^ < f{x) < f{xM)-C2\\x-XM\\'^ for allx satisfying x G B{xm,Po) 

for some po > 0; 
(I) Xm G dT> and both f and dV are smooth at xm , with V f{xM) 7^ 0; 

3. any lattice C QT) satisfying the following two conditions 

• 2Cr\ conv{C) C C (8) 

, 2r-'°s2dTsSrTi5)l+i/:n/:=^0 (9) 

if f satisfies (f) 

Then, there exist positive numbers A and t and an integer T such that the points specified by the Branch 
and Bound algorithm, {xf}, will satisfy the following asymptotic bound: For all t > T, with probability 1 — a 
we have 

Tt 

r{xt) < Ae (In*)'*'"'. 

We would like to make a few clarifying remarks about the theorem. First, note that for a random sample 
/ ^ GP(0, k) one of conditions (f) and (|) will be satisfied almost surely if k is a Matern kernel with v > 2 
and the squared exponential kernel because the sample / is twice differentiable almost surely by (Adler & 
Taylor, 2007, Theorem 1.4.2) and (Stein, 1999, §2.6)) and the vanishing of at least one of the eigenvalues of 
the Hessian is a co-dimension 1 condition in the space of all functions that are smooth at a given point, so it 
has zero chance of happening at the global maximum. Second, the two conditions (8) and (9) simply require 
that the lattice be "divisible by 2" and that it be fine enough so that the algorithm can sample inside the 
ball B{xm, Po) when the maximum of the function is located in the interior of the search space V. Finally, it 
is important to point out that the rate decay r does not depend on the choice of the lattice C, even though 
as stated, the statement of the theorem chooses r only after C is specified. The theorem was written this 
way simply for the sake of readability. 

Given the exponential rate of convergence we obtain in Theorem 5, we have the following finiteness conclusion 
for the cumulative regret accrued by our Branch and Bound algorithm: 

Corollary 6 Given k, f ^ GP(0, k) and C ^ V as in Theorem 5, the cumulative regret is bounded from 
above. 

Remark 7 It is worth pointing out the trivial observation that using a simple UCB algorithm with mono- 
tonically increasing and unbounded factor y/]3t, without any shrinking of the search space as we do here, 
necessarily leads to unbounded cumulative regret since eventually \f^t becomes large enough so that at points 
x' far away from the maximum, ^/P^at{x') becomes larger than f{xM) — fix). In fact, eventually the UCB 
algorithm will sample every point in the lattice C. 



4. Discussion 

In this paper we proposed a modification of tlie UCB algoritlim of (Srinivas et al., 2010) which addresses the 
noise free case. The key difference is that while the original algorithm achieves an ©(fa ) rate of convergence 
to the regret minimizer, we obtain an exponential rate in the number of function evaluations. In other words, 
the noise free problem is significantly easier, statistically speaking, than the noisy case. The key difference 
is that we need not invest any samples in noise reduction to determine whether our observations deviate far 
from their expectation. 

This allows us to discard pieces of the search space where the maximum is very unlikely to be, when compared 
to (Srinivas ct al., 2010). We show that this additional step leads to a considerable improvement of the regret 
accrued by the algorithm. In particular, the cumulative regret obtained by our Branch and Bound algorithm 
is bounded from above, whereas the cumulative regret bound obtained in the noisy bandit algorithm is 
unbounded. The possibility of dispensing with chunks of the search space can also be seen in the works 
involving hierarchical partitioning, e.g. (Munos, 2011), where regions of the space are deemed as less worthy 
of probing as time goes on. 

Our results mirror the observation in active learning that noise free and large margin learning of half spaces 
can be achieved much more rapidly than identifying a linear separator in the noisy case (Bshouty & Wattad, 
2006; Dasgupta et al., 2009). This is also reflected in classical uniform convergence results for supervised 
learning (Audibert & Tsybakov, 2007; Vapnik, 1998) where the achievable rate depends on the decay of 
probability mass near the margin. 

This suggests that the ability to extend our results to the noisy case is somewhat limited. An indication of 
what might be possible can be found in (Balcan et al., 2009), where regions of the version space are eliminated 
once they can be excluded with sufficiently high probability. One could model a corresponding Branch and 
Bound algorithm, which dispenses with points that lie outside the current (or perhaps the previous) relevant 
set when calculating the covariance matrix K in the posterior equations (3). Analysis of how much of an 
effect such a computational cost-cutting measure would have on the regret encountered by the algorithm is 
a subject of future research. 

We believe that an exciting extension can be found in guarantees for contextual bandits. Note, however, 
that the unpredictability of the context introduces new difficulties in terms of speed of convergence that 
need to be overcome. For instance, parameters for infrequent contexts will be estimated slowly unless there 
are strong correlations among contexts. 
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5. Proofs 

5.1. Approximation Results 

Proof [Lemma 1] We prove the claims in sequence. 

1. This foUows from Corollary 4.36 in (Steinwart & Christmann, 2008), with \a\ ~ 1. 

2. Same as above, just with |a| — 2. 

3. For any operator V with full column rank the projection on the image of V is given by V{V^V)~'^V^ . 
The operator V in the above case is given by the stacked vector of evaluation functionals 
k{xi, •),..., k{xn, •)• This provides us with Px- The remaining claims are standard linear algebra. 

4. Projection operators satisfy ||Pi:t|| ^ 1- This proves the second claim. The first claim can be seen from 
the fact that projecting on a subspace can only have a smaller norm than the superspace projection. 

5. We first show that the projection is an interpolation. This follows from 

h{xi) = Pi:Th{xi) = {Pi:Th, k{xi, •)) = {h, Pi.,TK{xi, •)) = {h, k{x„ •)) = h{Xi). 

Correspondingly g{xi) — h{xi) — h{xi) = for all Xi G xi;t- By construction Pi.rh uses h only in 
evaluations h{xi), hence for any two functions h,h' with h[xi) — h'{xi) we have Pi.rh = Pi-tH' . Since 
I!At|1 < 1 it follows that ||Pi.T/i|l < \\h\\-H- Hence there is no interpolation with norm smaller than 

\\Pl:Th\\. 



Proof [Lemma 2] To see the bound we again use the Cauchy-Schwartz inequality 

\h{x)-Pi..Th{x)\ = \{l-Pi.,T)h{x)\ 

= |((1 — Pi-T)h, k{x, •))-hI (by the defining property of ( , )^, 

cf. (Steinwart & Christmann, 2008), Def. 4.18) 
= \{h, (1 — Pi;t)k{x, ■))^\ (since 1 — Pi;t is an orthogonal projection and so self-adjoint) 
< \\h\\^ 11(1 — Pi;t)k{x, •)|| (by Cauchy-Schwarz) 

This inequality is clearly tight for h — {1 — Pi.t)k{x, •) by the nature of dual norms. Next note that 

II (1 - Pi-.tHx, ■)f = ((1 - Pi..T)nix, •), (1 - Pi:T)«(a;, •)) = («;(x, •), (1 - Pi:t)^^{x, •)) 
= k(x, x) - {k{x, •), Pi:tk{x, •)) = cr^{x). 

The second equality follows from the fact that 1 — Pi-t is idempotent. The last equality follows from the 
definition of Pi;t- The fact that (T|n(a;) is the residual variance of a Gaussian Process regression estimate is 
well known in the literature and follows, e.g. from the matrix inversion lemma. ■ 

Proof [Lemma 3] The first claim is an immediate consequence of the Lipschitz property of 5. To see the 
second claim we need to establish a number of issues: without loss of generality assume that the maximum 
within the convex hull containing x is attained at x (and that the maximum rather than the minimum 
denotes the maximum deviation from 0). 

The maximum distance of x to one of its vertices is bounded by S/y/2. This is established by considering the 
minimum enclosing ball and realizing that the maximum distance is achieved for the regular polyhedron. 

To see the maximum deviation from we exploit the fact that dxg{x) = by the assumption of x being the 
maximum (we need not consider cases where a; is on a facet of the polyhedral set since in this case we could 



easily reduce the dimensionality). In this case the largest deviation between g{x) and g{xi) is obtained by 
making g a quadratic function g{x') = ^ \\x' — x\\ . At distance -j= the function value is bounded by —^■ 
Since the latter bounds the maximum deviation it does bound it for g in particular. This proves the claim. ■ 

Proof [Proposition 4] Let H be the RKHS corresponding to k and h £ H a,n arbitrary element, with 
5 := (1 — Pi.T)h the residual defined in Lemma 1.5. By Lemma 1.3, we know that ||1 — Pi-.tW < 1 and so 
we have 

\\9\\H<U-Pi:T\\\\h\\H<\Mn (10) 

Moreover, by Lemma 1.2, we know that the second derivative of g is bounded by ||<7||^Q, and since by 
Lemma 1.5 we know that g vanishes at each x^, we can use Lemma 3.2 and the inequality given by inequality 
(10) to conclude that 

\h{x)-Pi..Th{x)\:^\g{x)\ 

< Mk^ by Lemma 3.2 

< ''^''^^'^' by inequality (10) 
and so for all a; G I? we have 

\hix)-P^..Thix)\<^\\h\\^ (11) 

On the other hand, by Lemma 2, we know that for all a; G I? we have the following tight bound: 

\h{x) - Pi..Th{x)\ < f7T(a;) \\h\\^ . (12) 

Now, given the fact that both inequalities (11) and (12) are bounding the same quantity and that the latter 
is a tight estimate, we necessarily have that 

aTix)\\h\\^ < ^j- \\h\\^. 
Cancehng \\h\\^ gives the desired result. 



5.2. Finiteness of Regret 

We begin with two lemmas from (Srinivas et al., 2010): 

Lemma 8 (Lemma 5.1 of (Srinivas et al., 2010)) Given any finite set C, any sequence of points 
{xi, X2-, . • .} C £ and / : £ — > M a sample from GP(0, k{-, ■)), for all a G (0, 1), we have 

p\VxeC,t>l: \f{x)-fit-i{x)\ <^t'Jt-i{x)] >l~a, 

where f3t — 2 In I —^ ) and {iTt} is any positive sequence satisfying \^ — = 1. Here \C\ denotes the number 
of elements in C. 



t ^* 



Lemma 9 (Lemma 5.2 in (Srinivas et aL, 2010)) Let C a non-empty finite set and f : C ^f M. an 

arbitrary function. Also assume that there exist functions /^, cr : £ — > M and a constant ^/]5, such that 

\f{x)-^i{x)\<^a WxeC. (13) 



Th 



en, we nave 



r{x) < 2V/3cr(x) < 2 V/3 max cr. 

Definition 10 (Covering Number) Denote by B a Banach space with norm \\-\\. Furthermore denote by 
B (- B a set in this space. Then the covering number n^(B,B) is defined as the minimum number of e balls 
with respect to the Banach space norm that are required to cover B entirely. 

Proof [Theorem 5] The proof consists of the following steps: 

• Global: We first show that after a finite number of steps the algorithm zooms in on the neighbourhood 
B{xm, Po)- This is done by first showing that e can be chosen small enough to squeeze the set /~"^((/m — 
e, /m] ) into any arbitrarily small neighbourhood of Xm and that as the function is sampled more and 
more densely, the UBC-LCB envelope around / becomes arbitrarily tight, hence eventually fitting the 
relevant set inside a small neighbourhood of xm- Please refer to Figure 4 for a graphical depiction of 
this process. 

G/: Since V is compact and / is continuous and has a unique maximum, for every p > 0, we can find 
an e = e(/9) > such that 

ri((/M-e,/M])CS(xM,p), 

where fM ~ max /. 

To see this, suppose on the contrary that there exists a radius p > such that for all e > we have 

f-H{fM-e,fM])^B{xM,p) 

which means that there exists a point x ^V such that f{xM) — f{x) < e but \\x — xm\\ > P- Now, 
for each i G N, pick a point x' G f~^ {{fM — j, /a/]) \ B{xm, p)- this gives us a sequence of points 
{x*} in V, which by the compactness of V has a convergent subsequence {x**"}, whose limit we will 
denote by x* . From the continuity of / and the fact that f{xM) — f{x^) < i, we can conclude that 
fi^Ai) ~ fi^*) = Oj which contradicts our assumption that / has a unique global maximum since 
we necessarily have x* ^ B{xm,p)- 

G//: Define e* := — - — , with po as in Condition (f ) of the statement of Theorem 5. 
G///: For each T, define the "relevant set" TZt QT> as follows: 

nT = IxeV 



Pt{x) + \/l3TcrT{x) > supfirix) - \/ fiTcrrix) 

G/y: Choose Pt — ^lii(r), with b chosen large enough to satisfy the conditions of Lemma 8. Then, it is 
possible to sample / densely enough so that 

v//3TmaxCTT(x) < e*, (14) 

so that TZt ^ B{xm, Po)- This is because as D is sampled more and more densely we have a — 0{S'^), 
where S is the distance between the points of the grid, and P = O(lnp-) = 0{—ln6) and so 
^/Pa — ?> as (5 — > 0, and so there exists a Jq small enough so that a lattice of resolution 6q would 
give us the bound given in inequality (14). The end point of this process is depicted in Figure 
4, where the relevant set TZt lies inside the non-shaded region: the reason for this inclusion and 
"thickness" 4e* is described below, in Step Li of the proof: cf. Equation (15). 

• Local: Once the algorithm has localized attention to a neighbourhood of X]\i, then we can show that the 
regret decreases exponentially; to do so, we will proceed by sampling the relevant set twice as densely 
and shrinking the relevant set, and repeating these two steps. The claim is that in each iteration, the 
maximum regret goes down exponentially and the number of the new points that are sampled in each 
refining iteration is asymptotically constant. To prove this, we will write down the equations governing 
the behaviour of the number of sampled points and a. We will adopt the following notation to carry 
out this task: 



._;: 



4e* 



-li±Ba 

-f 

-f±2e* 

Xi,f{xi)) 
Discarded 




Figure 4. The elimination of other smaller peaks. 

— Se - the resolution of tlie lattice of sampled points at the end of the (£+1)*'' refining iteration inside 
TZg+i (defined below). 

— e^ = sup (TNiix) at the end of the f^ iteration. Note that e^ ex (5|. Also, note that eg < e* by the 

choice of 60 ■ 

— Ni - number of points that have been sampled by the end of the i*^ iteration. 

— AiV^ = Nt+i - Ni. 

— TZg - the relevant set at the beginning of the £*^ iteration. Note that 7?.i C B{xm,Po)- 

diam(7?.£) ^^ , 

— Pi = . Note that pi < po- 



Li 



Ni<No + ns,(no,{R'^,\\-\\2)) where n^,, (Uo, (M'', || • II2)) is the ^o-covering number 

as defined in Definition f 
< No+^f{po,So) where ^fipo,^o) := ns, (b(0,po), (K'', II • II2)) 



<N,+Af{,M^^,S. 



C2 



iVo+AA(cyi^yh^,(5o 



where c 



UVb 



The expression \l °Y_^ "° comes about as follows: using the notations B = yj Pno ^'^'^ "' — '^a'o 



we know by Lemma 8 that / and jjl are intertwined with each other in the sense that both of the 



following chains of inequality hold: 



fi-Ba < f < fi+Ba 
f~Ba < M < f+Ba, 

which, combined together, give us the following chain of inequalities 

f-2Ba < fi-Bcr < f < ti+Ba < J+2Ba. (15) 

Since, we also know that a{x) < eg for all x € TZq, we can conclude that 

f-2Beo < fi-Ba < fi+Ba < f+2BeQ. 

Moreover, if condition (f) holds, we know that in 7?.o, the function / satisfies — cir^ < f{x) — 
/{xm) < —C2r^, where r = r{x) := \\x — xm\\, so we get that 

f{xM)-ciP-2Beo < n-Ba < ^+Ba < f{xM)-C2r^+2Beo. 

Now, recall that TZq is defined to consist of points x where ii(x)+Ba{x) > swp fi{x)—Ba(x), but given 

V 

the fact that we have the above outer envelope for /i ± Ba, we can conclude that 
7^0 C |a; f{xM)-C2r{xf+2Beo > max f{xM)~cir{xf-2Beo^ 
^{x\f{xM)-C2r{xf+2Beo > /M-2Beo} 
^ {x -C2r{xf+2Beo > -2Beo} 
= |a; C2r{xf < 4Beo| 

Now, if, on the other hand, / satisfies condition (|), then by the smoothness assumptions in (|), 
we know that V/^xm) is perpendicular to dV at xm and so there exist positive numbers ci and C2 
such that in a neighbourhood of xm we have 



'Cir < f ~ fixAi) < 



-C2r 



Note that in the argument above in the case of (f), the precise form of the lower bound on / was 
irrelevant, since all we are interested in is its maximum. So, the same argument goes through again. 
This is depicted in Figure 5, where B :— yj Pno = \/6lnJVo. 
Iji+i'. Now, let us suppose that we are the end of the l*"^ iteration. We have 

<7V,+AA(^cy|yinA^,| 
< Ni + C{\nNe)i 



by Proposition 4 

since N{2p, 25) ~ N{p, 5) for any p and 5 



So, the number of samples needed by the branch and bound algorithm is governed by the difference 
inequation 

ATVf < C(lniVf)3. (16) 



— a 
—fi±Ba 

-f 

-/ ± 2Beo 

Discarded 




Figure 5. The shrinking of the relevant set TZi. Here, B — yj Pnq 



To study the solutions of this difference equation, we consider the corresponding differential equa- 
tion: 

^=C(ln7V)i (f7) 

Since this equation is separable, we can write 

'"^ CdL 



(IniV)^ 



Now, letting (. = L he a, given number of iterations in the algorithm and N{L) the corresponding 
number of sampled points, we can integrate both sides of the above equation to get 



CM = CL. 



rN{L) ^^ 

In{q) (lnA^)4 Jo 
Given the fact that the integral on the left can't be solved analytically, we will use the lower bound 

N{L) ~ 7V(0) f^^^^ dN 



< 



to get 



(lnAf(L))4 JNio) (In TV) 4 

N{L) - N{0) 



C{lnN{L))^ 



< L 



(18) 



Given a time t, we will denote by £t the largest non-negative integer such that N^^ < t or if no 
such number exists. We illustrate this somewhat obtuse definition with the following example: 
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Now, by Lemma 9, for alH >> A'o we have 



n < 2v^maxcrt < 2Vb\ntee^ < —^ < 

< SeoVbh^ ( - j ''('-"^t+O"^" by Equation 18 

< SeoVbhTt ( M ('-"^t+O"^" for some D > since iV^.+i > Nq 

Dt 
, /1\ (lnt)^/4 d 

< Seovblnt - j for t satisfying Ini > - (see :*: below) since t < A^^^+i 

Et I In In t 

Et I gt 

< 8eov6e d- *)'*/'' 2(i„t)<i/4 for large enough t 
= Ae^^^^^ for A = 8eo\/5 and T = S/2. 

• The reason for the specific criterion hi i > ^ is that the function j. — ^sji is increasing when this 
condition is satisfied, and so decreasing x from Ng^ + 1 to i decreases its value, increasing the 
overall expression (^) (inx)<i/4 _ Xo see that j. — ^r^yj becomes increasing when Inx > |, we simply 
need to calculate its derivative: 

d X 1 d X 



dx{\nxY/'^ (lnx)'i/4 Ax{lnxY/^+'^ 
_\nx-i 



{Inxy/^' 

e of ,, ^w 
i and Ni^^i and so the function is indeed increasing in that interval. 



Moreover, since N^^^i > t, if the derivative of j-, — ^^dji '^^ positive at t, it is also positive between 



